scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2021"


Proceedings ArticleDOI
01 Jun 2021
TL;DR: This work constructs a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for and introduces a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingually document retrieval from multilingual and English resources.
Abstract: Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity—where languages have few reference articles—and information asymmetry—where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at https://nlp.cs.washington.edu/xorqa/.

82 citations


Proceedings Article
03 May 2021
TL;DR: GENRE as discussed by the authors proposes an autoregressive approach to generate unique names for each entity, left to right, token-by-token in an auto-regressive fashion and conditioned on the context.
Abstract: Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach leads to several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion and conditioned on the context. This enables us to mitigate the aforementioned technical issues since: (i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach, experimenting with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

47 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate the performance of 56 different methodologies for computing textual similarity across court case statements when applied on a dataset of Indian Supreme Court Cases and propose five of the best performing methods as appropriate for measuring similarity between case reports.
Abstract: In the domain of legal information retrieval, an important challenge is to compute similarity between two legal documents. Precedents (statements from prior cases) play an important role in The Common Law system, where lawyers need to frequently refer to relevant prior cases. Measuring document similarity is one of the most crucial aspects of any document retrieval system which decides the speed, scalability and accuracy of the system. Text-based and network-based methods for computing similarity among case reports have already been proposed in prior works but not without a few pitfalls. Since legal citation networks are generally highly disconnected, network based metrics are not suited for them. Till date, only a few text-based and predominant embedding based methods have been employed, for instance, TF-IDF based approaches, Word2Vec (Mikolov et al. 2013) and Doc2Vec (Le and Mikolov 2014) based approaches. We investigate the performance of 56 different methodologies for computing textual similarity across court case statements when applied on a dataset of Indian Supreme Court Cases. Among the 56 different methods, thirty are adaptations of existing methods and twenty-six are our proposed methods. The methods studied include models such as BERT (Devlin et al. 2018) and Law2Vec (Ilias 2019). It is observed that the more traditional methods (such as the TF-IDF and LDA) that rely on a bag-of-words representation performs better than the more advanced context-aware methods (like BERT and Law2Vec) for computing document-level similarity. Finally we nominate, via empirical validation, five of our best performing methods as appropriate for measuring similarity between case reports. Among these five, two are adaptations of existing methods and the other three are our proposed methods.

20 citations


Proceedings ArticleDOI
11 Jul 2021
TL;DR: This paper proposed Contrastive Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained query representations for dense retrieval, inspired by the classic information retrieval training axiom, query likelihood.
Abstract: Dense retrieval conducts text retrieval in the embedding space and has shown many advantages compared to sparse retrieval. Existing dense retrievers optimize representations of queries and documents with contrastive training and map them to the embedding space. The embedding space is optimized by aligning the matched query-document pairs and pushing the negative documents away from the query. However, in such training paradigm, the queries are only optimized to align to the documents and are coarsely positioned, leading to an anisotropic query embedding space. In this paper, we analyze the embedding space distributions and propose an effective training paradigm, Contrastive Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained query representations for dense retrieval. DANCE incorporates an additional dual training object of query retrieval, inspired by the classic information retrieval training axiom, query likelihood. With contrastive learning, the dual training object of DANCE learns more tailored representations for queries and documents to keep the embedding space smooth and uniform, thriving on the ranking performance of DANCE on the MS MARCO document retrieval task. Different from ANCE that only optimized with the document retrieval task, DANCE concentrates the query embeddings closer to document representations while making the document distribution more discriminative. Such concentrated query embedding distribution assigns more uniform negative sampling probabilities to queries and helps to sufficiently optimize query representations in the query retrieval task. Our codes are released at https://github.com/thunlp/DANCE.

15 citations


Posted Content
TL;DR: The second year of the TREC Deep Learning Track as mentioned in this paper, with the goal of studying ad hoc ranking in the large training data regime, was the first to evaluate the performance of BERT-style pretraining in large data regime.
Abstract: This is the second year of the TREC Deep Learning Track, with the goal of studying ad hoc ranking in the large training data regime. We again have a document retrieval task and a passage retrieval task, each with hundreds of thousands of human-labeled training queries. We evaluate using single-shot TREC-style evaluation, to give us a picture of which ranking methods work best when large data is available, with much more comprehensive relevance labeling on the small number of test queries. This year we have further evidence that rankers with BERT-style pretraining outperform other rankers in the large data regime.

14 citations


Journal ArticleDOI
TL;DR: In this article, a retrieval method of scientific documents based on HFS (Hesitation Fuzzy Sets) and BERT (Bidirectional Encoder Representations from Transformer) is proposed.
Abstract: When retrieving scientific documents with mathematical expressions as the main content, both mathematical expressions and their contextual text features require consideration. However, mathematical expressions are different from texts in terms of grammar and semantics. Thus, integrating the above features and realizing scientific document retrieval is difficult. In this study, a retrieval method of scientific documents based on HFS (Hesitation Fuzzy Sets) and BERT (Bidirectional Encoder Representations from Transformer) is proposed. This method is realized through utilizing the advantages of HFS in multi-attribute decision making and BERT in context-dependent similarity calculation. By analyzing mathematical expressions and calculating the membership degree of symbolic multi-attributes, the similarity of mathematical expressions can be obtained, which can improve the accuracy of mathematical expression recall. With the extraction of the text of the expression context, BERT is used to calculate the context similarity. Then, the recalled technical documents are sorted according to the similarity of context, and the final retrieval result can be obtained. Experiments were carried out on 10,372 Chinese and 11,770 English scientific documents in the NTCIR extended data set. The average value of MAP_ $k (k=10)$ for the recall results of scientific documents was 74.13%. The average $n$ DCG ( $n=10$ ) for the ranking of scientific documents was 86.04%.

14 citations


Proceedings ArticleDOI
11 Jul 2021
TL;DR: This paper propose a graph-based re-ranking model that performs multi-document interaction as the core of their iterative re-rank framework. But their method suffers from high number of relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction.
Abstract: Existing approaches for open-domain question answering (QA) are typically designed for questions that require either single-hop or multi-hop reasoning, which make strong assumptions of the complexity of questions to be answered. Also, multi-step document retrieval often incurs higher number of relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction. To address these challenges, we propose a unified QA framework to answer any-hop open-domain questions, which iteratively retrieves, reranks and filters documents, and adaptively determines when to stop the retrieval process. To improve the retrieval accuracy, we propose a graph-based reranking model that perform multi-document interaction as the core of our iterative reranking framework. Our method consistently achieves performance comparable to or better than the state-of-the-art on both single-hop and multi-hop open-domain QA datasets, including Natural Questions Open, SQuAD Open, and HotpotQA.

14 citations


Proceedings ArticleDOI
TL;DR: In this article, a Maximum-Marginal-Relevance (MMR) based BERT model was proposed to leverage negative feedback based on the MMR principle for the next clarifying question selection.
Abstract: Users often need to look through multiple search result pages or reformulate queries when they have complex information-seeking needs. Conversational search systems make it possible to improve user satisfaction by asking questions to clarify users' search intents. This, however, can take significant effort to answer a series of questions starting with "what/why/how". To quickly identify user intent and reduce effort during interactions, we propose an intent clarification task based on yes/no questions where the system needs to ask the correct question about intents within the fewest conversation turns. In this task, it is essential to use negative feedback about the previous questions in the conversation history. To this end, we propose a Maximum-Marginal-Relevance (MMR) based BERT model (MMR-BERT) to leverage negative feedback based on the MMR principle for the next clarifying question selection. Experiments on the Qulac dataset show that MMR-BERT outperforms state-of-the-art baselines significantly on the intent identification task and the selected questions also achieve significantly better performance in the associated document retrieval tasks.

13 citations


Proceedings ArticleDOI
11 Jul 2021
TL;DR: In this article, a Maximum-Marginal-Relevance (MMR) based BERT model was proposed to leverage negative feedback based on the MMR principle for the next clarifying question selection.
Abstract: Users often need to look through multiple search result pages or reformulate queries when they have complex information-seeking needs. Conversational search systems make it possible to improve user satisfaction by asking questions to clarify users' search intents. This, however, can take significant effort to answer a series of questions starting with "what/why/how". To quickly identify user intent and reduce effort during interactions, we propose an intent clarification task based on yes/no questions where the system needs to ask the correct question about intents within the fewest conversation turns. In this task, it is essential to use negative feedback about the previous questions in the conversation history. To this end, we propose a Maximum-Marginal-Relevance (MMR) based BERT model (MMR-BERT) to leverage negative feedback based on the MMR principle for the next clarifying question selection. Experiments on the Qulac dataset show that MMR-BERT outperforms state-of-the-art baselines significantly on the intent identification task and the selected questions also achieve significantly better performance in the associated document retrieval tasks.

13 citations


Book ChapterDOI
28 Mar 2021
TL;DR: The authors reproduce three passage score aggregation approaches proposed by Dai and Callan [5] for overcoming the limitation of the maximum input length limitation of BERT for passage retrieval and find that these approaches are not more effective for document retrieval in isolation, but can lead to increased effectiveness when combined with pre-fine-tuning on the MS MARCO passage dataset.
Abstract: While BERT has been shown to be effective for passage retrieval, its maximum input length limitation poses a challenge when applying the model to document retrieval. In this work, we reproduce three passage score aggregation approaches proposed by Dai and Callan [5] for overcoming this limitation. After reproducing their results, we generalize their findings through experiments with a new dataset and experiment with other pretrained transformers that share similarities with BERT. We find that these BERT variants are not more effective for document retrieval in isolation, but can lead to increased effectiveness when combined with “pre–fine-tuning” on the MS MARCO passage dataset. Finally, we investigate whether there is a difference between fine-tuning models on “deep” judgments (i.e., fewer queries with many judgments each) vs. fine-tuning on “shallow” judgments (i.e., many queries with fewer judgments each). Based on available data from two different datasets, we find that the two approaches perform similarly.

13 citations


Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the authors used structural topic modeling along with pointwise mutual information and recurrent neural network to acquire the best-associated document to the query, where each word will have a probability to be appropriate to a topic.
Abstract: As a result of the increase in the Internet usage, the data on the Web has been expanding exponentially. Since most of the data is unorganized and scattered globally, it is challenging to acquire the best-associated document to a user query. This article uses structural topic modeling along with pointwise mutual information and recurrent neural network to acquire the best-associated document to the query. In structural topic modeling, the idea used is that each word will have a probability to be appropriate to a topic. The topics belonging to each document are identified by using pointwise mutual information of the words that it comprises. It will be easy-going to discover the document of the topics, which user has asked for in his query. Here, recurrent neural networks are used to classify the query-related documents to yield better results. Also, pointwise mutual information has been applied for finding the similarity between words. This research used the RCV2 dataset for experimentation where the normalized discounted cumulative gain, accuracy, F-measure, and false detection rate were compared for measuring the performance of the model. The experiment’s results show that the proposed model performs better than the baseline variations and the baseline models.

Proceedings ArticleDOI
Xingjiao Wu1, Ziling Hu1, Xiangcheng Du1, Jing Yang1, Liang He1 
05 Jul 2021
TL;DR: This paper proposes an end-to-end united network named Dynamic Residual Fusion Network (DRFN) for the DLA task, and designs a dynamic residual feature fusion module which can fully utilize low-dimensional information and maintain high-dimensional category information.
Abstract: The document layout analysis (DLA) aims to split the document image into different interest regions and understand the role of each region, which has wide application such as optical character recognition (OCR) systems and document retrieval. However, it is a challenge to build a DLA system because the training data is very limited and lacks an efficient model. In this paper, we propose an end-to-end united network named Dynamic Residual Fusion Network (DRFN) for the DLA task. Specifically, we design a dynamic residual feature fusion module which can fully utilize low-dimensional information and maintain high-dimensional category information. Besides, to deal with the model overfitting problem that is caused by lacking enough data, we propose the dynamic select mechanism for efficient fine-tuning in limited train data. We experiment with two challenging datasets and demonstrate the effectiveness of the proposed module.

DOI
01 Dec 2021
TL;DR: The authors distill short narratives of the searchers' information needs into simple, yet precise entity-interaction graph patterns, which provides all information needed for a precise search, and feature variable nodes to flexibly allow for different substitutions of entities taking a specified role.
Abstract: Finding relevant publications in the scientific domain can be quite tedious: Accessing large-scale document collections often means to formulate an initial keyword-based query followed by many refinements to retrieve a sufficiently complete, yet manageable set of documents to satisfy one’s information need. Since keyword-based search limits researchers to formulating their information needs as a set of unconnected keywords, retrieval systems try to guess each user’s intent. In contrast, distilling short narratives of the searchers’ information needs into simple, yet precise entity-interaction graph patterns provides all information needed for a precise search. As an additional benefit, such graph patterns may also feature variable nodes to flexibly allow for different substitutions of entities taking a specified role. An evaluation over the PubMed document collection quantifies the gains in precision for our novel entity-interaction-aware search. Moreover, we perform expert interviews and a questionnaire to verify the usefulness of our system in practice.

Journal ArticleDOI
TL;DR: This work proposes a novel task of discovering sentences for argumentation about the meaning of statutory terms and investigates the feasibility of developing a system that responds to a query with a list of sentences that mention the term in a way that is useful for understanding and elaborating its meaning.
Abstract: In this work we study, design, and evaluate computational methods to support interpretation of statutory terms. We propose a novel task of discovering sentences for argumentation about the meaning of statutory terms. The task models the analysis of past treatment of statutory terms, an exercise lawyers routinely perform using a combination of manual and computational approaches. We treat the discovery of sentences as a special case of ad hoc document retrieval. The specifics include retrieval of short texts (sentences), specialized document types (legal case texts), and, above all, the unique definition of document relevance provided in detailed annotation guidelines. To support our experiments we assembled a data set comprising 42 queries (26,959 sentences) which we plan to release to the public in the near future in order to support further research. Most importantly, we investigate the feasibility of developing a system that responds to a query with a list of sentences that mention the term in a way that is useful for understanding and elaborating its meaning. This is accomplished by a systematic assessment of different features that model the sentences’ usefulness for interpretation. We combine features into a compound measure that accounts for multiple aspects. The definition of the task, the assembly of the data set, and the detailed task analysis provide a solid foundation for employing a learning-to-rank approach.

Journal ArticleDOI
19 Oct 2021
TL;DR: A variety of deep learning models have been proposed, and each model presents a set of neural network components to extract features that are used for ranking as mentioned in this paper, and they have been compared along different dimensions in order to understand the major contributions and limitations of each model.
Abstract: Ranking models are the main components of information retrieval systems. Several approaches to ranking are based on traditional machine learning algorithms using a set of hand-crafted features. Recently, researchers have leveraged deep learning models in information retrieval. These models are trained end-to-end to extract features from the raw data for ranking tasks, so that they overcome the limitations of hand-crafted features. A variety of deep learning models have been proposed, and each model presents a set of neural network components to extract features that are used for ranking. In this paper, we compare the proposed models in the literature along different dimensions in order to understand the major contributions and limitations of each model. In our discussion of the literature, we analyze the promising neural components, and propose future research directions. We also show the analogy between document retrieval and other retrieval tasks where the items to be ranked are structured documents, answers, images and videos.

Book ChapterDOI
28 Mar 2021
TL;DR: In this paper, the BERT-PLI model was used for cross-domain transfer of retrieval models for domain specific search, and the results showed that the transfer of BERT on the paragraph-level leads to comparable results between both domains as well as first promising results for the crossdomain transfer on the document-level.
Abstract: Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models – such as BERT – revolutionized web and news search. Naturally, the community aims to adapt these advancements to cross-domain transfer of retrieval models for domain specific search. In the context of legal document retrieval, Shao et al. propose the BERT-PLI framework by modeling the Paragraph-Level Interactions with the language model BERT. In this paper we reproduce the original experiments, we clarify pre-processing steps and add missing scripts for framework steps, however we are not able to reproduce the evaluation results. Contrary to the original paper, we demonstrate that the domain specific paragraph-level modelling does not appear to help the performance of the BERT-PLI model compared to paragraph-level modelling with the original BERT. In addition to our legal search reproducibility study, we investigate BERT-PLI for document retrieval in the patent domain. We find that the BERT-PLI model does not yet achieve performance improvements for patent document retrieval compared to the BM25 baseline. Furthermore, we evaluate the BERT-PLI model for cross-domain retrieval between the legal and patent domain on individual components, both on a paragraph and document-level. We find that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level. For reproducibility and transparency as well as to benefit the community we make our source code and the trained models publicly available.

Book ChapterDOI
29 Jan 2021
TL;DR: Experiments show that the proposed framework results in improved performance by providing more relevant documents to the user query by using Jaccard Similarity to calculate word similarity in search result documents.
Abstract: With the rise of digital documents on the internet, the demand for accurate search results has been a challenging task and many notable algorithms have been used to cater to such queries. However, with the increasingly fast-paced discoveries in machine learning and artificial intelligence, many of the algorithms have been outdated and no longer used. It is necessary to display only the relevant content that the user queried for. In this paper, IIMDR: Intelligence Integration Model for Document Retrieval is proposed. Jaccard Similarity is used to calculate word similarity in search result documents. Experiments using the proposed model on RCV dataset show that the proposed framework results in improved performance by providing more relevant documents to the user query. An accuracy of 86.78% on the RCV1 dataset has been achieved with this method.

Book ChapterDOI
21 Sep 2021
TL;DR: The CLEF 2021 SimpleText track as discussed by the authors addresses the opportunities and challenges of text simplification approaches to improve scientific information access head-on, and provides appropriate data and benchmarks, starting with pilot tasks in 2019 and creating a community of NLP and IR researchers working together to resolve one of the greatest challenges of today.
Abstract: Information retrieval has moved from traditional document retrieval in which search is an isolated activity, to modern information access where search and the use of the information are fully integrated. But non-experts tend to avoid authoritative primary sources such as scientific literature due to their complex language, internal vernacular, or lacking prior background knowledge. Text simplification approaches can remove some of these barriers, thereby avoiding that users rely on shallow information in sources prioritizing commercial or political incentives rather than the correctness and informational value. The CLEF 2021 SimpleText track addresses the opportunities and challenges of text simplification approaches to improve scientific information access head-on. We aim to provide appropriate data and benchmarks, starting with pilot tasks in 2021, and create a community of NLP and IR researchers working together to resolve one of the greatest challenges of today.

Book ChapterDOI
Xuanang Chen1, Ben He1, Kai Hui2, Le Sun1, Yingfei Sun1 
28 Mar 2021
TL;DR: In this paper, the authors investigated the effectiveness of two knowledge distillation models on the document ranking task and proposed two simplifications of the TinyBERT model, which significantly outperforms BERT-Base.
Abstract: Despite the effectiveness of utilizing the BERT model for document ranking, the high computational cost of such approaches limits their uses To this end, this paper first empirically investigates the effectiveness of two knowledge distillation models on the document ranking task In addition, on top of the recently proposed TinyBERT model, two simplifications are proposed Evaluations on two different and widely-used benchmarks demonstrate that Simplified TinyBERT with the proposed simplifications not only boosts TinyBERT, but also significantly outperforms BERT-Base when providing 15\(\times \) speedup

Proceedings ArticleDOI
23 Mar 2021
TL;DR: A finer-grained categorization scheme is introduced that sheds more light on the impact of absent keyphrases on scientific document retrieval and how the proposed scheme can offer a new angle to evaluate the output of neural keyphrase generation models.
Abstract: Neural keyphrase generation models have recently attracted much interest due to their ability to output absent keyphrases, that is, keyphrases that do not appear in the source text. In this paper, we discuss the usefulness of absent keyphrases from an Information Retrieval (IR) perspective, and show that the commonly drawn distinction between present and absent keyphrases is not made explicit enough. We introduce a finer-grained categorization scheme that sheds more light on the impact of absent keyphrases on scientific document retrieval. Under this scheme, we find that only a fraction (around 20%) of the words that make up keyphrases actually serves as document expansion, but that this small fraction of words is behind much of the gains observed in retrieval effectiveness. We also discuss how the proposed scheme can offer a new angle to evaluate the output of neural keyphrase generation models.

Journal ArticleDOI
TL;DR: The proposed open-source decision support system (DSS) is effective and can substantially decrease the document retrieval and citation screening steps' workload and error rate.
Abstract: The systematic literature review (SLR) process includes several steps to collect secondary data and analyze it to answer research questions. In this context, the document retrieval and primary study selection steps are heavily intertwined and known for their repetitiveness, high human workload, and difficulty identifying all relevant literature. This study aims to reduce human workload and error of the document retrieval and primary study selection processes using a decision support system (DSS). An open-source DSS is proposed that supports the document retrieval step, dataset preprocessing, and citation classification. The DSS is domain-independent, as it has proven to carefully select an article’s relevance based solely on the title and abstract. These features can be consistently retrieved from scientific database APIs. Additionally, the DSS is designed to run in the cloud without any required programming knowledge for reviewers. A Multi-Channel CNN architecture is implemented to support the citation screening process. With the provided DSS, reviewers can fill in their search strategy and manually label only a subset of the citations. The remaining unlabeled citations are automatically classified and sorted based on probability. It was shown that for four out of five review datasets, the DSS's use achieved significant workload savings of at least 10%. The cross-validation results show that the system provides consistent results up to 88.3% of work saved during citation screening. In two cases, our model yielded a better performance over the benchmark review datasets. As such, the proposed approach can assist the development of systematic literature reviews independent of the domain. The proposed DSS is effective and can substantially decrease the document retrieval and citation screening steps' workload and error rate.

Journal ArticleDOI
TL;DR: A two-stage information retrieval system based on an interactive multimodal genetic algorithm (IMGA) for a query weight optimization system that outperforms several state-of-the-art query weight optimized approaches in terms of the precision rate and the recall rate.
Abstract: Query weight optimization, which aims to find an optimal combination of the weights of query terms for sorting relevant documents, is an important topic in the information retrieval system. Due to the huge search space, the query optimization problem is intractable, and evolutionary algorithms have become one popular approach. But as the size of the database grows, traditional retrieval approaches may return a lot of results, which leads to low efficiency and poor practicality. To solve this problem, this paper proposes a two-stage information retrieval system based on an interactive multimodal genetic algorithm (IMGA) for a query weight optimization system. The proposed IMGA has two stages: quantity control and quality optimization. In the quantity control stage, a multimodal genetic algorithm with the aid of the niching method selects multiple promising combinations of query terms simultaneously by which the numbers of retrieved documents are controlled in an appropriate range. In the quality optimization stage, an interactive genetic algorithm is designed to find the optimal query weights so that the most user-friendly document retrieval sequence can be yielded. Users’ feedback information will accelerate the optimization process, and a genetic algorithm (GA) performs interactively with the action of relevance feedback mechanism. Replacing user evaluation, a mathematical model is built to evaluate the fitness values of individuals. In the proposed two-stage method, not only the number of returned results can be controlled, but also the quality and accuracy of retrieval can be improved. The proposed method is run on the database which with more than 2000 documents. The experimental results show that our proposed method outperforms several state-of-the-art query weight optimization approaches in terms of the precision rate and the recall rate.

Journal ArticleDOI
TL;DR: The proposed system develops a speech recognition system and introduces a novel indexing scheme, based on wavelet trees for retrieving data, on the basis of spoken document retrieval for political speeches, delivered in a variety of environments.
Abstract: Spoken document retrieval for a specific context is a very trending and interesting area of research. It makes it convenient for users to search through archives of speech data, which is not possible manually as it is very time consuming and expensive. In the current article, we focus on performing the same for political speeches, delivered in a variety of environments. The technique used here takes an archive of spoken documents (audio files) as input and performs automatic speech recognition (ASR) on it to derive the textual transcripts, using deep neural networks (DNN), hidden markov models (HMM) and Gaussian mixture models (GMM). These transcriptions are further pruned for indexing by applying certain pre-processing techniques. Thereafter, it builds time and space efficient index of the documents using wavelet trees for its retrieval. The constructed index is searched through to find the count of occurrences of the words in the query, fired by the users. These counts are then utilized to calculate the term frequency - inverse document frequency (TF-IDF) scores, and then the similarity score of the query with each document is calculated using cosine similarity method. Finally, the documents are ranked based on these scores in the order of relevance. Therefore, the proposed system develops a speech recognition system and introduces a novel indexing scheme, based on wavelet trees for retrieving data.

Journal ArticleDOI
TL;DR: Error analysis suggests that the phrasal features are particularly useful for classifying four groups of genre classes, i.e. unscripted speech, fiction, news reports, and academic writing, all distributed with distinct structural characteristics, and they demonstrate an incremental degree of formality in the continuum of language complexity.
Abstract: Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between synt...

Proceedings ArticleDOI
01 Apr 2021
TL;DR: CovRelex as discussed by the authors is a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers, which aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID 19 scientific papers published rapidly.
Abstract: This paper presents CovRelex, a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers. This work aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID-19 scientific papers published rapidly. Our system can be accessed via https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex/. © 2021 Association for Computational Linguistics

Proceedings ArticleDOI
19 Apr 2021
TL;DR: The authors conducted an empirical study on relevance modeling in three representative IR tasks, i.e., document retrieval, answer retrieval, and response retrieval, to investigate how to leverage different modeling focuses of relevance to improve these IR tasks.
Abstract: Relevance plays a central role in information retrieval (IR), which has received extensive studies starting from the 20th century. The definition and the modeling of relevance has always been critical challenges in both information science and computer science research areas. Along with the debate and exploration on relevance, IR has already become a core task in many real-world applications, such as Web search engines, question answering systems, conversational bots, and so on. While relevance acts as a unified concept in all these retrieval tasks, the inherent definitions are quite different due to the heterogeneity of these tasks. This raises a question to us: Do these different forms of relevance really lead to different modeling focuses? To answer this question, in this work, we conduct an empirical study on relevance modeling in three representative IR tasks, i.e., document retrieval, answer retrieval, and response retrieval. Specifically, we attempt to study the following two questions: 1) Does relevance modeling in these tasks really show differences in terms of natural language understanding (NLU)? We employ 16 linguistic tasks to probe a unified retrieval model over these three retrieval tasks to answer this question. 2) If there do exist differences, how can we leverage the findings to enhance the relevance modeling? We proposed three intervention methods to investigate how to leverage different modeling focuses of relevance to improve these IR tasks. We believe the way we study the problem as well as our findings would be beneficial to the IR community.

Proceedings ArticleDOI
TL;DR: This paper conducted an empirical study on relevance modeling in three representative IR tasks, i.e., document retrieval, answer retrieval, and response retrieval, to investigate how to leverage different modeling focuses of relevance to improve these IR tasks.
Abstract: Relevance plays a central role in information retrieval (IR), which has received extensive studies starting from the 20th century. The definition and the modeling of relevance has always been critical challenges in both information science and computer science research areas. Along with the debate and exploration on relevance, IR has already become a core task in many real-world applications, such as Web search engines, question answering systems, conversational bots, and so on. While relevance acts as a unified concept in all these retrieval tasks, the inherent definitions are quite different due to the heterogeneity of these tasks. This raises a question to us: Do these different forms of relevance really lead to different modeling focuses? To answer this question, in this work, we conduct an empirical study on relevance modeling in three representative IR tasks, i.e., document retrieval, answer retrieval, and response retrieval. Specifically, we attempt to study the following two questions: 1) Does relevance modeling in these tasks really show differences in terms of natural language understanding (NLU)? We employ 16 linguistic tasks to probe a unified retrieval model over these three retrieval tasks to answer this question. 2) If there do exist differences, how can we leverage the findings to enhance the relevance modeling? We proposed three intervention methods to investigate how to leverage different modeling focuses of relevance to improve these IR tasks. We believe the way we study the problem as well as our findings would be beneficial to the IR community.

Journal ArticleDOI
TL;DR: The results show that document length normalization alone is not sufficient, especially in pseudo-relevance feedback retrieval, and a novel principled approach to passage-based (document) retrieval using fuzzy set theory is presented.
Abstract: In this article, we present a novel principled approach to passage-based (document) retrieval using fuzzy set theory. The approach formulates passage score combination according to general relevance decision principles. By operationalizing these principles using aggregation operators of fuzzy set theory, our approach justifies the common heuristics of taking the maximum constituent passage score as the overall document score. Experiments show that this heuristics is only the near best, with some fuzzy set aggregation operators stipulated in our approach being better methods. The significance of our principled approach is the applicability of many passage score combination methods, potentially bringing further performance enhancement. Experiments on several text retrieval conference collections demonstrate that our approach performs significantly better than document-based retrieval. While recent works in the literature mostly employ document-based rather than passage-based retrieval due to the common conception that document length normalization solves the problem of varying document lengths, our results show that document length normalization alone is not sufficient, especially in pseudo-relevance feedback retrieval.

Book ChapterDOI
28 Mar 2021
TL;DR: The authors proposed Contextualized Embeddings for Query Expansion (CEQE) that utilizes query-focused contextualized embedding vectors for query expansion in ad-hoc document retrieval.
Abstract: In this work we leverage recent advances in context-sensitive language models to improve the task of query expansion. Contextualized word representation models, such as ELMo and BERT, are rapidly replacing static embedding models. We propose a new model, Contextualized Embeddings for Query Expansion (CEQE), that utilizes query-focused contextualized embedding vectors. We study the behavior of contextual representations generated for query expansion in ad-hoc document retrieval. We conduct our experiments on probabilistic retrieval models as well as in combination with neural ranking models. We evaluate CEQE on two standard TREC collections: Robust and Deep Learning. We find that CEQE outperforms static embedding-based expansion methods on multiple collections (by up to 18% on Robust and 31% on Deep Learning on average precision) and also improves over proven probabilistic pseudo-relevance feedback (PRF) models. We further find that multiple passes of expansion and reranking result in continued gains in effectiveness with CEQE-based approaches outperforming other approaches. The final model incorporating neural and CEQE-based expansion score achieves gains of up to 5% in P@20 and 2% in AP on Robust over the state-of-the-art transformer-based re-ranking model, Birch.

Posted Content
TL;DR: This article proposed Contextualized Embeddings for Query Expansion (CEQE) that utilizes query-focused contextualized embedding vectors for query expansion in ad-hoc document retrieval.
Abstract: In this work we leverage recent advances in context-sensitive language models to improve the task of query expansion. Contextualized word representation models, such as ELMo and BERT, are rapidly replacing static embedding models. We propose a new model, Contextualized Embeddings for Query Expansion (CEQE), that utilizes query-focused contextualized embedding vectors. We study the behavior of contextual representations generated for query expansion in ad-hoc document retrieval. We conduct our experiments on probabilistic retrieval models as well as in combination with neural ranking models. We evaluate CEQE on two standard TREC collections: Robust and Deep Learning. We find that CEQE outperforms static embedding-based expansion methods on multiple collections (by up to 18% on Robust and 31% on Deep Learning on average precision) and also improves over proven probabilistic pseudo-relevance feedback (PRF) models. We further find that multiple passes of expansion and reranking result in continued gains in effectiveness with CEQE-based approaches outperforming other approaches. The final model incorporating neural and CEQE-based expansion score achieves gains of up to 5% in P@20 and 2% in AP on Robust over the state-of-the-art transformer-based re-ranking model, Birch.