Top 144 papers published in the topic of Document retrieval in 2019

Proceedings Article•DOI•

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations

[...]

Mohammad Aliannejadi¹, Hamed Zamani², Fabio Crestani¹, W. Bruce Croft²•Institutions (2)

University of Lugano¹, University of Massachusetts Amherst²

18 Jul 2019

TL;DR: This paper proposed a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval, which takes into account the original query and previous question-answer interactions while selecting the next question.

...read moreread less

Abstract: Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

...read moreread less

193 citations

Journal Article•DOI•

Combining Fact Extraction and Verification with Neural Semantic Matching Networks

[...]

Yixin Nie¹, Haonan Chen¹, Mohit Bansal¹•Institutions (1)

University of North Carolina at Chapel Hill¹

17 Jul 2019

TL;DR: Li et al. as mentioned in this paper presented a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification.

...read moreread less

Abstract: The increasing concern with misinformation has stimulated research efforts on automatic fact checking. The recentlyreleased FEVER dataset introduced a benchmark factverification task in which a system is asked to verify a claim using evidential sentences from Wikipedia documents. In this paper, we present a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification. For evidence retrieval (document retrieval and sentence selection), unlike traditional vector space IR models in which queries and sources are matched in some pre-designed term vector space, we develop neural models to perform deep semantic matching from raw textual input, assuming no intermediate term representation and no access to structured external knowledge bases. We also show that Pageview frequency can also help improve the performance of evidence retrieval results, that later can be matched by using our neural semantic matching network. For claim verification, unlike previous approaches that simply feed upstream retrieved evidence and the claim to a natural language inference (NLI) model, we further enhance the NLI model by providing it with internal semantic relatedness scores (hence integrating it with the evidence retrieval modules) and ontological WordNet features. Experiments on the FEVER dataset indicate that (1) our neural semantic matching method outperforms popular TF-IDF and encoder models, by significant margins on all evidence retrieval metrics, (2) the additional relatedness score and WordNet features improve the NLI model via better semantic awareness, and (3) by formalizing all three subtasks as a similar semantic matching problem and improving on all three stages, the complete model is able to achieve the state-of-the-art results on the FEVER test set (two times greater than baseline results).1

...read moreread less

192 citations

Proceedings Article•DOI•

Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval

[...]

Zeynep Akkalyoncu Yilmaz¹, Wei Yang¹, Haotian Zhang², Jimmy Lin¹•Institutions (2)

University of Waterloo¹, University of Washington²

20 Aug 2019

TL;DR: This paper is able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles.

...read moreread less

Abstract: This paper applies BERT to ad hoc document retrieval on news articles, which requires addressing two challenges: relevance judgments in existing test collections are typically provided only at the document level, and documents often exceed the length that BERT was designed to handle. Our solution is to aggregate sentence-level evidence to rank documents. Furthermore, we are able to leverage passage-level relevance judgments fortuitously available in other domains to fine-tune BERT models that are able to capture cross-domain notions of relevance, and can be directly used for ranking news articles. Our simple neural ranking models achieve state-of-the-art effectiveness on three standard test collections.

...read moreread less

153 citations

Posted Content•

Simple Applications of BERT for Ad Hoc Document Retrieval

[...]

Wei Yang, Haotian Zhang, Jimmy Lin

26 Mar 2019-arXiv: Information Retrieval

TL;DR: The authors apply inference on sentences individually, and then aggregating sentence scores to produce document scores, and report the highest average precision on these datasets by neural approaches that they are aware of. But they do not address the challenge posed by documents that are typically longer than the length of input BERT was designed to handle.

...read moreread less

Abstract: Following recent successes in applying BERT to question answering, we explore simple applications to ad hoc document retrieval. This required confronting the challenge posed by documents that are typically longer than the length of input BERT was designed to handle. We address this issue by applying inference on sentences individually, and then aggregating sentence scores to produce document scores. Experiments on TREC microblog and newswire test collections show that our approach is simple yet effective, as we report the highest average precision on these datasets by neural approaches that we are aware of.

...read moreread less

117 citations

DOI•

Models and Data for Simple Applications of BERT for Ad Hoc Document Retrieval

[...]

Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, Jimmy Lin

08 Jun 2019

TL;DR: This work addresses the challenge posed by documents that are typically longer than the length of input BERT was designed to handle by applying inference on sentences individually, and then aggregating sentence scores to produce document scores.

...read moreread less

113 citations

Proceedings Article•DOI•

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations.

[...]

Mohammad Aliannejadi¹, Hamed Zamani², Fabio Crestani¹, W. Bruce Croft²•Institutions (2)

University of Lugano¹, University of Massachusetts Amherst²

15 Jul 2019-arXiv: Computation and Language

TL;DR: This paper formulate the task of asking clarifying questions in open-domain information-seeking conversational systems, propose an offline evaluation methodology for the task, and collect a dataset, called Qulac, through crowdsourcing, which significantly outperforms competitive baselines.

...read moreread less

Abstract: Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of P@1, which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.

...read moreread less

109 citations

Proceedings Article•DOI•

Applying BERT to Document Retrieval with Birch

[...]

Zeynep Akkalyoncu Yilmaz¹, Shengjin Wang, Wei Yang¹, Haotian Zhang², Jimmy Lin³ - Show less +1 more•Institutions (3)

University of Waterloo¹, University of Washington², Singapore University of Technology and Design³

01 Nov 2019

TL;DR: This demonstration focuses on technical challenges in the integration of NLP and IR capabilities, along with the design rationale behind the approach to tightly-coupled integration between Python and Java Virtual Machine.

...read moreread less

Abstract: We present Birch, a system that applies BERT to document retrieval via integration with the open-source Anserini information retrieval toolkit to demonstrate end-to-end search over large document collections. Birch implements simple ranking models that achieve state-of-the-art effectiveness on standard TREC newswire and social media test collections. This demonstration focuses on technical challenges in the integration of NLP and IR capabilities, along with the design rationale behind our approach to tightly-coupled integration between Python (to support neural networks) and the Java Virtual Machine (to support document retrieval using the open-source Lucene search library). We demonstrate integration of Birch with an existing search interface as well as interactive notebooks that highlight its capabilities in an easy-to-understand manner.

...read moreread less

107 citations

Proceedings Article•DOI•

SHNE: Representation Learning for Semantic-Associated Heterogeneous Networks

[...]

Chuxu Zhang¹, Ananthram Swami², Nitesh V. Chawla¹•Institutions (2)

University of Notre Dame¹, United States Army Research Laboratory²

30 Jan 2019

TL;DR: A Semantic-aware Heterogeneous Network Embedding model (SHNE) is developed that performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network.

...read moreread less

Abstract: Representation learning in heterogeneous networks faces challenges due to heterogeneous structural information of multiple types of nodes and relations, and also due to the unstructured attribute or content (e.g., text) associated with some types of nodes. While many recent works have studied homogeneous, heterogeneous, and attributed networks embedding, there are few works that have collectively solved these challenges in heterogeneous networks. In this paper, we address them by developing a Semantic-aware Heterogeneous Network Embedding model (SHNE). SHNE performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network. Extensive experiments demonstrate that SHNE outperforms state-of-the-art baselines in various heterogeneous network mining tasks, such as link prediction, document retrieval, node recommendation, relevance search, and class visualization.

...read moreread less

83 citations

Journal Article•DOI•

Document Layout Analysis: A Comprehensive Survey

[...]

Galal M. BinMakhashen¹, Sabri A. Mahmoud¹•Institutions (1)

King Fahd University of Petroleum and Minerals¹

16 Oct 2019-ACM Computing Surveys

TL;DR: This survey paper presents a critical study of different document layout analysis techniques and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field.

...read moreread less

Abstract: Document layout analysis (DLA) is a preprocessing step of document understanding systems. It is responsible for detecting and annotating the physical structure of documents. DLA has several important applications such as document retrieval, content categorization, text recognition, and the like. The objective of DLA is to ease the subsequent analysis/recognition phases by identifying the document-homogeneous blocks and by determining their relationships. The DLA pipeline consists of several phases that could vary among DLA methods, depending on the documents’ layouts and final analysis objectives. In this regard, a universal DLA algorithm that fits all types of document-layouts or that satisfies all analysis objectives has not been developed, yet. In this survey paper, we present a critical study of different document layout analysis techniques. The study highlights the motivational reasons for pursuing DLA and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field. The DLA framework consists of preprocessing, layout analysis strategies, post-processing, and performance evaluation phases. Overall, the article delivers an essential baseline for pursuing further research in document layout analysis.

...read moreread less

76 citations

Posted Content•

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

[...]

Sunil Mohan¹, Donghui Li•Institutions (1)

National Institutes of Health¹

25 Feb 2019-arXiv: Computation and Language

TL;DR: To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

...read moreread less

Abstract: This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

...read moreread less

56 citations

Journal Article•DOI•

Privacy-Preserving Multi-Keyword Top- $k$ k Similarity Search Over Encrypted Data

[...]

Ding Xiaofeng¹, Liu Peng¹, Hai Jin¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Mar 2019-IEEE Transactions on Dependable and Secure Computing

TL;DR: The proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

...read moreread less

Abstract: Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to support a variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involved to outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health records in electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released or shared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encrypt data before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analytic operators like keyword-based top- $k$ k document retrieval obsolete. In this paper, we investigate the multi-keyword top- $k$ k search problem for big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, for the privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, which makes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchanged under stronger privacy. For improving the query efficiency, we propose a group multi-keyword top- $k$ k search scheme based on the idea of partition, where a group of tree-based indexes are constructed for all documents. Finally, we combine these methods together into an efficient and secure approach to address our proposed top- $k$ k similarity search. Extensive experimental results on real-life data sets demonstrate that our proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

...read moreread less

Proceedings Article•

MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts

[...]

Sunil Mohan¹, Donghui Li•Institutions (1)

National Institutes of Health¹

12 Mar 2019

TL;DR: The MedMentions corpus as mentioned in this paper is a manually annotated resource for the recognition of biomedical concepts, which includes over 4,000 abstracts and over 350,000 linked mentions.

...read moreread less

Abstract: This paper presents the formal release of {\em MedMentions}, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

...read moreread less

Proceedings Article•DOI•

ReQA: An Evaluation for End-to-End Answer Retrieval Models

[...]

Amin Ahmad, Noah Constant¹, Yinfei Yang¹, Daniel Cer¹•Institutions (1)

Google¹

10 Jul 2019

TL;DR: Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence- and paragraph-level answer retrieval models, is introduced and baselines are established using both neural encoding models as well as classical information retrieval techniques.

...read moreread less

Abstract: Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question-Answering (ReQA), a benchmark for evaluating large-scale sentence-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.

...read moreread less

Posted Content•

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

[...]

Andreas Hanselowski¹, Christian Stab¹, Claudia Schulz², Zile Li, Iryna Gurevych¹ - Show less +1 more•Institutions (2)

Technische Universität Darmstadt¹, Imperial College London²

29 Oct 2019-arXiv: Computation and Language

TL;DR: A new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation is presented.

...read moreread less

Abstract: Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

...read moreread less

Proceedings Article•DOI•

Team DOMLIN: Exploiting Evidence Enhancement for the FEVER Shared Task

[...]

Dominik Stammbach, Guenter Neumann

01 Nov 2019

TL;DR: This paper contains the system description for the second Fact Extraction and VERification (FEVER) challenge and proposes a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence.

...read moreread less

Abstract: This paper contains our system description for the second Fact Extraction and VERification (FEVER) challenge. We propose a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence. We use a publicly available document retrieval module and have fine-tuned BERT checkpoints for sentence se- lection and as the entailment classifier. We report a FEVER score of 68.46% on the blind testset.

...read moreread less

Proceedings Article•DOI•

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

[...]

Andreas Hanselowski¹, Christian Stab¹, Claudia Schulz², Zile Li, Iryna Gurevych¹ - Show less +1 more•Institutions (2)

Technische Universität Darmstadt¹, Imperial College London²

29 Aug 2019

TL;DR: This paper presented a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation.

...read moreread less

Abstract: Automated fact-checking based on machine learning is a promising approach to identify false information distributed on the web. In order to achieve satisfactory performance, machine learning methods require a large corpus with reliable annotations for the different tasks in the fact-checking process. Having analyzed existing fact-checking corpora, we found that none of them meets these criteria in full. They are either too small in size, do not provide detailed annotations, or are limited to a single domain. Motivated by this gap, we present a new substantially sized mixed-domain corpus with annotations of good quality for the core fact-checking tasks: document retrieval, evidence extraction, stance detection, and claim validation. To aid future corpus construction, we describe our methodology for corpus creation and annotation, and demonstrate that it results in substantial inter-annotator agreement. As baselines for future research, we perform experiments on our corpus with a number of model architectures that reach high performance in similar problem settings. Finally, to support the development of future models, we provide a detailed error analysis for each of the tasks. Our results show that the realistic, multi-domain setting defined by our data poses new challenges for the existing models, providing opportunities for considerable improvement by future systems.

...read moreread less

Journal Article•DOI•

Extended co-citation search: Graph-based document retrieval on a co-citation network containing citation context information

[...]

Masaki Eto¹•Institutions (1)

Women's College, Kolkata¹

01 Nov 2019-Information Processing and Management

TL;DR: It is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-Citation search techniques, and that sole use of a graph- based algorithm is not enough to enhance search performances from the baselines.

...read moreread less

Abstract: This study proposes a novel extended co-citation search technique, which is graph-based document retrieval on a co-citation network containing citation context information. The proposed search expands the scope of the target documents by repetitively spreading the relationship of co-citation in order to obtain relevant documents that are not identified by traditional co-citation searches. Specifically, this search technique is a combination of (a) applying a graph-based algorithm to compute the similarity score on a complicated network, and (b) incorporating co-citation contexts into the process of calculating similarity scores to reduce the negative effects of an increasing number of irrelevant documents. To evaluate the search performance of the proposed search, 10 proposed methods (five representative graph-based algorithms applied to co-citation networks weighted with/without contexts) are compared with two kinds of baselines (a traditional co-citation search with/without contexts) in information retrieval experiments based on two test collections (biomedicine and computer linguistic articles). The experiment results showed that the scores of the normalized discounted cumulative gain (nDCG@K) of the proposed methods using co-citation contexts tended to be higher than those of the baselines. In addition, the combination of the random walk with restart (RWR) algorithm and the network weighted with contexts achieved the best search performance among the 10 proposed methods. Thus, it is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-citation search techniques, and that sole use of a graph-based algorithm is not enough to enhance search performances from the baselines.

...read moreread less

Journal Article•DOI•

A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system

[...]

Dilip Kumar Sharma¹, Dilip Kumar Sharma², Rajendra Pamula², Durg Singh Chauhan¹•Institutions (2)

GLA University¹, Indian Institute of Technology Dhanbad²

25 Feb 2019-Journal of Ambient Intelligence and Humanized Computing

TL;DR: Fuzzy logic is also employed, which improves the performance of accelerated particle swarm optimization by controlling various parameters in this paper, which gets better results in comparison to other automatic query expansion approaches.

...read moreread less

Abstract: Nowadays, searching the relevant documents from a large dataset becomes a big challenge. Automatic query expansion is one of the techniques, which addresses this problem by refining the query. A new query expansion approach using cuckoo search and accelerated particle swarm optimization technique is proposed in this paper. The proposed approach mainly focused to find the most relevant expanded query rather than suitable expansion terms. In this paper, Fuzzy logic is also employed, which improves the performance of accelerated particle swarm optimization by controlling various parameters. We have compared the proposed approach with other existing and recently developed automatic query expansion approaches on various evaluating parameters such as average recall, average precision, Mean-Average Precision, F-measure and precision-recall graph. We have evaluated the performance of all approaches on three datasets CISI, CACM and TREC-3. The results obtained for all three datasets depict that the proposed approach gets better results in comparison to other automatic query expansion approaches.

...read moreread less

Posted Content•

FAKTA: An Automatic End-to-End Fact Checking System

[...]

Moin Nadeem, Wei Fang¹, Brian Xu, Mitra Mohtarami¹, James Glass¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

07 Jun 2019-arXiv: Computation and Language

TL;DR: FAKTA as mentioned in this paper is a unified framework that integrates various components of a fact checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis.

...read moreread less

Abstract: We present FAKTA which is a unified framework that integrates various components of a fact checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis. FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions

...read moreread less

Book Chapter•DOI•

RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation

[...]

Blaž Škrlj¹, Andraž Repar¹, Senja Pollak¹•Institutions (1)

Jožef Stefan Institute¹

14 Oct 2019

TL;DR: In this paper, load centrality, a graph-theoretic measure applied to graphs derived from a given text, is used to efficiently identify and rank keywords, and meta vertices (aggregates of existing vertices) and systematic redundancy filters are introduced.

...read moreread less

Abstract: Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.

...read moreread less

Proceedings Article•DOI•

FAKTA: An Automatic End-to-End Fact Checking System

[...]

Moin Nadeem, Wei Fang¹, Brian Xu, Mitra Mohtarami¹, James Glass¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2019

TL;DR: FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions, and FAKTA integrates various components of a fact-checking process.

...read moreread less

Abstract: We present FAKTA which is a unified framework that integrates various components of a fact-checking process: document retrieval from media sources with various types of reliability, stance detection of documents with respect to given claims, evidence extraction, and linguistic analysis. FAKTA predicts the factuality of given claims and provides evidence at the document and sentence level to explain its predictions.

...read moreread less

Book Chapter•DOI•

AUEB at BioASQ 7: Document and Snippet Retrieval

[...]

Georgios-Ioannis Brokos¹, Polyvios Liosis¹, Ryan McDonald¹, Dimitris Pappas¹, Ion Androutsopoulos¹ - Show less +1 more•Institutions (1)

Athens University of Economics and Business¹

16 Sep 2019

TL;DR: The submissions of aueb to the bioasq 7 document and snippet retrieval tasks (parts of Task 7b, Phase A) are presented and models that jointly learn to retrieve documents and snippets are experimented with.

...read moreread less

Abstract: We present the submissions of aueb to the bioasq 7 document and snippet retrieval tasks (parts of Task 7b, Phase A) Our systems build upon the methods we used in bioasq 6 This year we also experimented with models that jointly learn to retrieve documents and snippets, as opposed to using separate pipelined models for document and snippet retrieval We also experimented with models based on bert [5] Our systems obtained the best document and snippet retrieval results for all batches of the challenge that we participated in

...read moreread less

Proceedings Article•DOI•

ReQA: An Evaluation for End-to-End Answer Retrieval Models

[...]

Amin Ahmad, Noah Constant¹, Yinfei Yang¹, Daniel Cer¹•Institutions (1)

Google¹

10 Jul 2019-arXiv: Computation and Language

TL;DR: The authors proposed the Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence-and paragraph-level answer retrieval models, and established baselines using both neural encoding models as well as classical information retrieval techniques.

...read moreread less

Abstract: Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question Answering (ReQA), a benchmark for evaluating large-scale sentence- and paragraph-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.

...read moreread less

Proceedings Article•DOI•

The SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)

[...]

Ryan Clancy¹, Nicola Ferro², Claudia Hauff³, Jimmy Lin¹, Tetsuya Sakai⁴, Ze Zhong Wu¹ - Show less +2 more•Institutions (4)

University of Waterloo¹, University of Padua², Delft University of Technology³, Waseda University⁴

18 Jul 2019

TL;DR: This workshop tackles the replicability challenge for ad hoc document retrieval, via a common Docker interface specification to support images that capture systems performing ad hoc retrieval experiments on standard test collections.

...read moreread less

Abstract: The importance of repeatability, replicability, and reproducibility is broadly recognized in the computational sciences, both in supporting desirable scientific methodology as well as sustaining empirical progress. This workshop tackles the replicability challenge for ad hoc document retrieval, via a common Docker interface specification to support images that capture systems performing ad hoc retrieval experiments on standard test collections.

...read moreread less

Proceedings Article•DOI•

Answer Interaction in Non-factoid Question Answering Systems

[...]

Chen Qu¹, Liu Yang¹, W. Bruce Croft¹, Falk Scholer², Yongfeng Zhang³ - Show less +1 more•Institutions (3)

University of Massachusetts Amherst¹, RMIT University², Rutgers University³

08 Mar 2019

TL;DR: This paper used Amazon Mechanical Turk to investigate three answer presentation and interaction approaches in a non-factoid question answering setting and found that people perceive and react to good and bad answers very differently and can identify good answers relatively quickly.

...read moreread less

Abstract: Information retrieval systems are evolving from document retrieval to answer retrieval. Web search logs provide large amounts of data about how people interact with ranked lists of documents, but very little is known about interaction with answer texts. In this paper, we use Amazon Mechanical Turk to investigate three answer presentation and interaction approaches in a non-factoid question answering setting. We find that people perceive and react to good and bad answers very differently, and can identify good answers relatively quickly. Our results provide the basis for further investigation of effective answer interaction and feedback methods.

...read moreread less

Book Chapter•DOI•

RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation

[...]

Blaž Škrlj¹, Andraž Repar¹, Senja Pollak¹•Institutions (1)

Jožef Stefan Institute¹

15 Jul 2019-arXiv: Computation and Language

TL;DR: This work explores how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords.

...read moreread less

Abstract: Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.

...read moreread less

Journal Article•DOI•

Document/query expansion based on selecting significant concepts for context based retrieval of medical images

[...]

Mouna Torjmen-Khemakhem, Karim Gasmi

01 Jul 2019-Journal of Biomedical Informatics

TL;DR: This paper proposes a new expansion method for medical text (query/document) based on retro-semantic mapping between textual terms and UMLS concepts that are relevant in medical image retrieval that significantly improves the retrieval accuracy and outperforms the approaches offered in the literature.

...read moreread less

Journal Article•DOI•

Research on information retrieval model based on ontology

[...]

Binbin Yu¹, Binbin Yu²•Institutions (2)

Jilin University¹, College of Information Technology²

01 Feb 2019-Eurasip Journal on Wireless Communications and Networking

TL;DR: A domain ontology model with document processing and document retrieval is proposed, and the feasibility and superiority of the domain ontological model are proved by the method of experiment.

...read moreread less

Abstract: An information retrieval system not only occupies an important position in the network information platform, but also plays an important role in information acquisition, query processing, and wireless sensor networks. It is a procedure to help researchers extract documents from data sets as document retrieval tools. The classic keyword-based information retrieval models neglect the semantic information which is not able to represent the user’s needs. Therefore, how to efficiently acquire personalized information that users need is of concern. The ontology-based systems lack an expert list to obtain accurate index term frequency. In this paper, a domain ontology model with document processing and document retrieval is proposed, and the feasibility and superiority of the domain ontology model are proved by the method of experiment.

...read moreread less

Posted Content•

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

[...]

Rui Zhang, Caitlin Westerfield, Sungrok Shim, Garrett Bingham, Alexander R. Fabbri, Neha Verma, William Hu, Dragomir R. Radev¹ - Show less +4 more•Institutions (1)

Yale University¹

08 Jun 2019-arXiv: Computation and Language

TL;DR: The model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks and can also be directly applied to another language pair without any training label.

...read moreread less

Abstract: In this paper, we propose to boost low-resource cross-lingual document retrieval performance with deep bilingual query-document representations. We match queries and documents in both source and target languages with four components, each of which is implemented as a term interaction-based deep neural network with cross-lingual word embeddings as input. By including query likelihood scores as extra features, our model effectively learns to rerank the retrieved documents by using a small number of relevance labels for low-resource language pairs. Due to the shared cross-lingual word embedding space, the model can also be directly applied to another language pair without any training label. Experimental results on the MATERIAL dataset show that our model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks.

...read moreread less

Proceedings Article•DOI•

TopicSifter: Interactive Search Space Reduction through Targeted Topic Modeling

[...]

Hannah Kim¹, Dongjin Choi¹, Barry L. Drake², Alex Endert¹, Haesun Park¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Georgia Tech Research Institute²

01 Oct 2019

TL;DR: This paper presents TopicSifter, a visual analytics system for interactive search space reduction that utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.

...read moreread less

Abstract: Topic modeling is commonly used to analyze and understand large document collections. However, in practice, users want to focus on specific aspects or “targets” rather than the entire corpus. For example, given a large collection of documents, users may want only a smaller subset which more closely aligns with their interests, tasks, and domains. In particular, our paper focuses on large-scale document retrieval with high recall where any missed relevant documents can be critical. A simple keyword matching search is generally not effective nor efficient as 1) it is difficult to find a list of keyword queries that can cover the documents of interest before exploring the dataset, 2) some documents may not contain the exact keywords of interest but may still be highly relevant, and 3) some words have multiple meanings, which would result in irrelevant documents included in the retrieved subset. In this paper, we present TopicSifter, a visual analytics system for interactive search space reduction. Our system utilizes targeted topic modeling based on nonnegative matrix factorization and allows users to give relevance feedback in order to refine their target and guide the topic modeling to the most relevant results.

...read moreread less

Showing papers on "Document retrieval published in 2019"